import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Load the dataset
df = pd.read_csv('Loan.csv')
Missing values analysis
df.isnull().sum()
| 0 | |
|---|---|
| ApplicationDate | 0 |
| Age | 0 |
| AnnualIncome | 0 |
| CreditScore | 0 |
| EmploymentStatus | 0 |
| EducationLevel | 0 |
| Experience | 0 |
| LoanAmount | 0 |
| LoanDuration | 0 |
| MaritalStatus | 0 |
| NumberOfDependents | 0 |
| HomeOwnershipStatus | 0 |
| MonthlyDebtPayments | 0 |
| CreditCardUtilizationRate | 0 |
| NumberOfOpenCreditLines | 0 |
| NumberOfCreditInquiries | 0 |
| DebtToIncomeRatio | 0 |
| BankruptcyHistory | 0 |
| LoanPurpose | 0 |
| PreviousLoanDefaults | 0 |
| PaymentHistory | 0 |
| LengthOfCreditHistory | 0 |
| SavingsAccountBalance | 0 |
| CheckingAccountBalance | 0 |
| TotalAssets | 0 |
| TotalLiabilities | 0 |
| MonthlyIncome | 0 |
| UtilityBillsPaymentHistory | 0 |
| JobTenure | 0 |
| NetWorth | 0 |
| BaseInterestRate | 0 |
| InterestRate | 0 |
| MonthlyLoanPayment | 0 |
| TotalDebtToIncomeRatio | 0 |
| LoanApproved | 0 |
| RiskScore | 0 |
There are no missing values present in our dataset
Summary
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 36 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ApplicationDate 20000 non-null object 1 Age 20000 non-null int64 2 AnnualIncome 20000 non-null int64 3 CreditScore 20000 non-null int64 4 EmploymentStatus 20000 non-null object 5 EducationLevel 20000 non-null object 6 Experience 20000 non-null int64 7 LoanAmount 20000 non-null int64 8 LoanDuration 20000 non-null int64 9 MaritalStatus 20000 non-null object 10 NumberOfDependents 20000 non-null int64 11 HomeOwnershipStatus 20000 non-null object 12 MonthlyDebtPayments 20000 non-null int64 13 CreditCardUtilizationRate 20000 non-null float64 14 NumberOfOpenCreditLines 20000 non-null int64 15 NumberOfCreditInquiries 20000 non-null int64 16 DebtToIncomeRatio 20000 non-null float64 17 BankruptcyHistory 20000 non-null int64 18 LoanPurpose 20000 non-null object 19 PreviousLoanDefaults 20000 non-null int64 20 PaymentHistory 20000 non-null int64 21 LengthOfCreditHistory 20000 non-null int64 22 SavingsAccountBalance 20000 non-null int64 23 CheckingAccountBalance 20000 non-null int64 24 TotalAssets 20000 non-null int64 25 TotalLiabilities 20000 non-null int64 26 MonthlyIncome 20000 non-null float64 27 UtilityBillsPaymentHistory 20000 non-null float64 28 JobTenure 20000 non-null int64 29 NetWorth 20000 non-null int64 30 BaseInterestRate 20000 non-null float64 31 InterestRate 20000 non-null float64 32 MonthlyLoanPayment 20000 non-null float64 33 TotalDebtToIncomeRatio 20000 non-null float64 34 LoanApproved 20000 non-null int64 35 RiskScore 20000 non-null float64 dtypes: float64(9), int64(21), object(6) memory usage: 5.5+ MB
The datatypes are correctly type casted.
df['LoanApproved'] = df['LoanApproved'].map({1: 'Approved', 0: 'Not Approved'})
Correlation Matrix
# Select numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
# Calculate the correlation matrix for numeric columns
corr_matrix = df[numeric_cols].corr()
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, cmap='coolwarm', square=True)
plt.title('Correlation Matrix')
plt.show()
Data Distribution
df['EmploymentStatus'].value_counts().plot(kind = 'pie', autopct = '%.2f%%')
<Axes: ylabel='count'>
df['LoanApproved'].value_counts().plot(kind = 'pie', autopct = '%.2f%%')
<Axes: ylabel='count'>
df['EducationLevel'].value_counts().plot(kind = 'pie', autopct = '%.2f%%')
<Axes: ylabel='count'>
df['LoanPurpose'].value_counts().plot(kind = 'pie', autopct = '%.2f%%')
<Axes: ylabel='count'>
INSIGHTS
- Age vs Loan Amount
Applicants between the ages of 30-45 tend to request higher loan amounts, with approval rates higher in this age group compared to younger or older groups.
The middle-aged demographic often represents financially stable individuals, which lenders may favor.
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='Age', y='LoanAmount', hue='LoanApproved', alpha=0.7)
plt.title('Age vs Loan Amount')
plt.xlabel('Age')
plt.ylabel('Loan Amount')
plt.legend(title='Loan Approved')
plt.show()
- Annual Income Distribution by Education
Graduates tend to have higher incomes, and their loan approval rates are higher across income brackets. Education positively impacts earning potential and eligibility for loans.
plt.figure(figsize=(8, 5))
sns.violinplot(data=df, x='EducationLevel', y='AnnualIncome', palette='muted', split=True, hue='LoanPurpose')
plt.title('Annual Income by Education')
plt.xlabel('Education Level')
plt.ylabel('Annual Income')
plt.show()
- Annual Income vs Credit Score
Applicants with higher annual incomes generally have better credit scores, leading to higher loan approval rates. High income and a good credit score reduce the risk for lenders, increasing approval chances.
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='AnnualIncome', y='CreditScore', hue='LoanApproved', alpha=0.7, palette='coolwarm')
plt.title('Annual Income vs Credit Score')
plt.xlabel('Annual Income')
plt.ylabel('Credit Score')
plt.legend(title='Loan Approved')
plt.show()
- Employment Status vs Loan Approval
Full-time employees have significantly higher loan approval rates compared to part-time, unemployed, or self-employed individuals. Lenders prioritize steady income from full-time employment when approving loans.
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='EmploymentStatus', hue='LoanApproved', palette='viridis')
plt.title('Loan Approval by Employment Status')
plt.xlabel('Employment Status')
plt.ylabel('Count')
plt.legend(title='Loan Approved')
plt.show()
- Credit Score vs Risk Score
Higher credit scores correlate with lower risk scores, resulting in more loan approvals. Credit score is a key indicator for assessing financial risk and eligibility for loans.
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='CreditScore', y='RiskScore', alpha=0.6, hue='LoanApproved', palette='coolwarm')
plt.title('Credit Score vs Risk Score')
plt.xlabel('Credit Score')
plt.ylabel('Risk Score')
plt.legend(title='Loan Approved')
plt.show()
- Marital Status vs Loan Amount
Married applicants tend to apply for higher loan amounts and have higher approval rates compared to single applicants. Marriage might signal financial stability or dual-income households, increasing loan eligibility.
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='MaritalStatus', y='LoanAmount', hue='LoanApproved', palette='pastel')
plt.title('Loan Amount by Marital Status')
plt.xlabel('Marital Status')
plt.ylabel('Loan Amount')
plt.legend(title='Loan Approved')
plt.show()
- Loan Purpose vs Loan Amount
Home purchases and business investments generally involve higher loan amounts compared to personal or vehicle loans. Loan purposes like home or business investments require larger capital, reflecting borrower needs.
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='LoanPurpose', y='LoanAmount')
plt.title('Loan Amount by Loan Purpose')
plt.xlabel('Loan Purpose')
plt.ylabel('Loan Amount')
plt.show()
- Savings Account Balance vs Loan Approval
Approved applicants typically have higher savings account balances compared to rejected ones. Higher savings balances reduce perceived financial risk, increasing approval likelihood.
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='LoanApproved', y='SavingsAccountBalance')
plt.title('Savings Account Balance by Loan Approval')
plt.xlabel('Loan Approved')
plt.ylabel('Savings Account Balance')
plt.show()
- Annual Income vs Loan Amount
Higher annual income correlates with larger loan amounts. Applicants with advanced education levels often apply for higher loans. Income and education level are both strong indicators of loan eligibility and borrowing capacity.
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='AnnualIncome', y='LoanAmount', hue='EducationLevel', alpha=0.7, palette='viridis')
plt.title('Annual Income vs Loan Amount by Education Level')
plt.xlabel('Annual Income')
plt.ylabel('Loan Amount')
plt.legend(title='Education Level')
plt.show()
- Debt-to-Income Ratio vs Credit Score
Applicants with lower debt-to-income ratios typically maintain higher credit scores, regardless of home ownership status. Lower debt burdens are often associated with better financial discipline and creditworthiness.
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='DebtToIncomeRatio', y='CreditScore', hue='HomeOwnershipStatus', palette='coolwarm', alpha=0.6)
plt.title('Debt-to-Income Ratio vs Credit Score by Home Ownership Status')
plt.xlabel('Debt-to-Income Ratio')
plt.ylabel('Credit Score')
plt.legend(title='Home Ownership Status')
plt.show()
- Monthly Income vs Loan Amount
Married applicants with higher monthly incomes tend to apply for and receive larger loan amounts compared to unmarried individuals. Married applicants might have dual incomes, enabling them to handle larger loan obligations.
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='MonthlyIncome', y='LoanAmount', hue='MaritalStatus', alpha=0.7, palette='pastel')
plt.title('Monthly Income vs Loan Amount by Marital Status')
plt.xlabel('Monthly Income')
plt.ylabel('Loan Amount')
plt.legend(title='Marital Status')
plt.show()
- Credit Card Utilization Rate vs Credit Score
Lower credit card utilization rates are strongly associated with higher credit scores across all education levels. Maintaining low credit card utilization demonstrates responsible credit usage, improving credit scores.
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='CreditCardUtilizationRate', y='CreditScore', alpha=0.7, hue='EducationLevel', palette='muted')
plt.title('Credit Card Utilization Rate vs Credit Score by Education Level')
plt.xlabel('Credit Card Utilization Rate')
plt.ylabel('Credit Score')
plt.legend(title='Education Level')
plt.show()
- Bankruptcy History vs Net Worth
Applicants with no history of bankruptcy generally have a higher net worth compared to those who have declared bankruptcy. Bankruptcy negatively impacts wealth accumulation, which can influence future credit decisions.
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='BankruptcyHistory', y='NetWorth')
plt.title('Net Worth by Bankruptcy History')
plt.xlabel('Bankruptcy History (Yes=1, No=0)')
plt.ylabel('Net Worth')
plt.show()
- Risk Score vs Credit Score
There’s a strong negative correlation (-0.75) between RiskScore and CreditScore, indicating that higher credit scores reduce financial risk. This relationship highlights how maintaining a good credit score minimizes perceived financial risks for lenders.
plt.figure(figsize=(8, 5))
sns.heatmap(df[['RiskScore', 'CreditScore']].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation between Risk Score and Credit Score')
plt.show()
- Average Loan Amount by Marital Status and Employment Status
Married applicants generally apply for higher loan amounts across all employment statuses. Full-time employees, regardless of marital status, request higher loan amounts than part-time or unemployed applicants. This demonstrates how marital and employment statuses jointly influence the financial needs and borrowing behavior.
loan_avg = df.pivot_table(values='LoanAmount', index='MaritalStatus', columns='EmploymentStatus', aggfunc='mean')
plt.figure(figsize=(10, 6))
sns.heatmap(loan_avg, annot=True, fmt=".1f", cmap="YlGnBu", linewidths=0.5)
plt.title('Average Loan Amount by Marital Status and Employment Status')
plt.xlabel('Employment Status')
plt.ylabel('Marital Status')
plt.show()
- Employment Status and Homeownership Status with Risk Scoring
plt.figure(figsize=(10, 6))
pivot_table = df.pivot_table(index='EmploymentStatus', columns='HomeOwnershipStatus', values='RiskScore', aggfunc='mean')
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title('Risk Score by Employment Status and Homeownership Status')
plt.xlabel('Homeownership Status')
plt.ylabel('Employment Status')
Text(95.72222222222221, 0.5, 'Employment Status')
- Employment Status and Homeownership Status with Annual Income
plt.figure(figsize=(10, 6))
pivot_table = df.pivot_table(index='EducationLevel', columns='HomeOwnershipStatus', values='TotalDebtToIncomeRatio', aggfunc='mean')
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title('Risk Score by Employment Status and Homeownership Status')
plt.xlabel('Homeownership Status')
plt.ylabel('Employment Status')
Text(95.72222222222221, 0.5, 'Employment Status')
- Net Worth vs Risk Score
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='NetWorth', y='RiskScore', hue='LoanApproved', alpha=0.7, palette='coolwarm')
plt.title('Net Worth vs Risk Score')
Text(0.5, 1.0, 'Net Worth vs Risk Score')
- Net Worth by Loan Approval Status
plt.figure(figsize=(8, 5))
sns.barplot(data=df, x='LoanApproved', y='NetWorth')
plt.title('Net Worth by Loan Approval Status')
plt.xlabel('Loan Approved')
Text(0.5, 0, 'Loan Approved')
- Credit Card Utilization vs Risk Scoring
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='CreditCardUtilizationRate', y='RiskScore', hue='LoanApproved', alpha=0.7, palette='coolwarm')
plt.title('Credit Card Utilization Rate vs Risk Score')
Text(0.5, 1.0, 'Credit Card Utilization Rate vs Risk Score')
- Annual Income vs Total Liabilities
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='AnnualIncome', y='TotalLiabilities', hue='LoanApproved', alpha=0.7, palette='coolwarm')
plt.title('Annual Income vs Total Liabilities')
plt.xlabel('Annual Income')
plt.ylabel('Total Liabilities')
plt.legend(title='Loan Approved')
plt.show()
- Debt-to-Income Ratio vs. Monthly Loan Payment
plt.figure(figsize=(10, 6))
sns.scatterplot(
data=df,
x='MonthlyLoanPayment',
y='TotalDebtToIncomeRatio',
hue='LoanApproved', # Color based on Loan approved
alpha=0.6
)
plt.title('Debt-to-Income Ratio vs. Monthly Loan Payment')
plt.xlabel('Monthly Loan Payment')
plt.ylabel('Debt-to-Income Ratio')
plt.legend(title='Loan approved')
plt.show()
- Risk Score and Loan Approval Percentage by Bankruptcy History and Employment Status
plt.figure(figsize=(10, 6))
# Example: Pivot for risk score mean
mean_risk_score = df.pivot_table(
index='BankruptcyHistory',
columns='EmploymentStatus',
values='RiskScore',
aggfunc='mean'
)
df1 = df
df1['LoanApproved'] = df1['LoanApproved'].map({'Approved': 1, 'Not Approved': 0})
# Calculate loan approval percentage
loan_approval_pct = df1.groupby(['BankruptcyHistory', 'EmploymentStatus'])['LoanApproved'].mean().unstack() * 100
# Combine data for annotations
annot = mean_risk_score.copy()
for i in annot.index:
for j in annot.columns:
risk = mean_risk_score.loc[i, j]
pct = loan_approval_pct.loc[i, j]
annot.loc[i, j] = f"{risk:.2f}\n({pct:.1f}%)"
# Plotting the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(
mean_risk_score,
annot=annot,
fmt="",
cmap="coolwarm",
linewidths=0.5,
cbar_kws={'label': 'Mean Risk Score'}
)
plt.title('Risk Score and Loan Approval Percentage by Bankruptcy History and Employment Status')
plt.xlabel('EmploymentStatus')
plt.ylabel('BankruptcyHistory')
plt.show()
<ipython-input-128-043cef68047c>:22: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '49.66
(24.7%)' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
annot.loc[i, j] = f"{risk:.2f}\n({pct:.1f}%)"
<ipython-input-128-043cef68047c>:22: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '51.84
(28.6%)' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
annot.loc[i, j] = f"{risk:.2f}\n({pct:.1f}%)"
<ipython-input-128-043cef68047c>:22: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '53.23
(18.6%)' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
annot.loc[i, j] = f"{risk:.2f}\n({pct:.1f}%)"
<Figure size 1000x600 with 0 Axes>